I love reading and decoding machine learning research papers. There is so much incredible information to parse through – a goldmine for us data scientists! I was thrilled when the best papers from the peerless ICLR 2019 (International Conference on Learning Representations) conference were announced.
I couldn’t wait to get my hands on them.
However, the majority of research papers are quite difficult to understand. They are written with a specific audience in mind (fellow researchers) and hence they assume a certain level of knowledge.
I faced the same issue when I first dabbled into these research papers. I struggled mightily to parse through them and grasp what the underlying technique was. That’s why I decided to help my fellow data scientists in understanding these research papers.
There are so many incredible academic conferences happening these days and we need to keep ourselves updated with the latest machine learning developments. This article is my way of giving back to the community that has given me so much!
In this article, we’ll look at the two best papers from the ICLR 2019 conference. There is a heavy emphasis on neural networks here so in case you need a refresher or an introduction, check out the below resources:
Let’s break down these two incredible papers and understand their approaches.
The structure of a natural language is hierarchical. This means that larger units or constituents comprise of smaller units or constituents (phrases). This structure is usually tree-like.
While the standard LSTM architecture allows different neurons to track information at different time scales, it does not have an explicit bias towards modeling a hierarchy of units. This paper proposes to add such an inductive bias by ordering the neurons.
The researchers aim to integrate a tree structure into a neural network language model. The reason behind doing this is to improve generalization via better inductive bias and at the same time, potentially reduce the need for a large amount of training data.
If you want to know more about LSTM applications, you can read our guides:
And if you prefer learning via video, you can take the below course which has a healthy dose of LSTM:
This is where things become really interesting (and really cool for all you nerds out there!).
This paper proposes Ordered Neurons. It is a new inductive bias for RNN that forces neurons to represent information at different time-scales.
This inductive bias helps to store long-term information in high-term neurons. The short-term information (that can be rapidly forgotten) is kept in low ranking neurons.
A new RNN unit – ON-LSTM – is proposed. The new model uses an architecture similar to the standard LSTM:
The difference is that the update function for the cell state ct is replaced with a new function cumax().
It might be difficult to discern a hierarchy of information between the neurons since the gates in an LSTM act independently on each neuron. So, the researchers have proposed to make the gate for each neuron dependent on the others by enforcing the order in which neurons should be updated.
Pretty fascinating, right?
ON-LSTM includes a new gating mechanism and a new activation function cumax(). The cumax() function and LSTM are combined together to create a new model ON-LSTM. That explains why this model is biased towards performing tree-like composition operations.
I wanted to spend a few moments talking about the cumax() function. It is the key to unlocking the approach presented in this awesome paper.
This cumax() activation function is introduced to enforce an order to the update frequency:
g^= cumax(…) = cumsum(softmax(…)),
Here, cumsum denotes the cumulative sum. “g^” can be seen as an expectation of a binary gate, “g”, which splits the cell state into two segments:
Thus, the model can apply different update rules on each segment to differentiate long/short information.
The paper also introduces a new master forget gate ft and a new master input gate it. These entities are also based on the cumax() function.
The values in the master forget gate monotonically increase from 0 to 1 by following the properties of the cumax() function. A similar thing happens in the master input gate where the values monotonically decrease from 1 to 0.
These gates serve as high-level control for the update operations of cell states. We can define a new update rule using the master gates:
The researchers evaluated their model on four tasks:
Here are the final results:
The horizontal axis indicates the length of the sequence, and the vertical axis indicates the accuracy of models performance on the corresponding test set. The ON-LSTM model gives an impressive performance on sequences longer than 3.
The ON-LSTM model shows better generalization while facing structured data with various lengths. A tree-structured model can achieve quite a strong performance on this dataset.
This is one of my favorite papers of 2019. Let’s break it down into easy-to-digest sections!
Pruning is the process of removing unnecessary weights from neural networks. This process potentially reduces the parameter-counts by more than 90% without affecting the accuracy. It also decreases the size and energy consumption of a trained network, making our inference more efficient.
However, if a network can be reduced in size, why don’t we train this smaller architecture instead to make training more efficient?
This is because the architectures uncovered by pruning are harder to train from the beginning and bring down the accuracy significantly.
This paper aims to show that there exist smaller sub-networks that train from the start. These networks learn at least as fast as their larger counterpart while achieving similar test accuracy.
For instance, we randomly sample and train sub-networks from a fully connected network for MNIST and convolutional networks for CIFAR10:
The dashed lines trace the iteration of minimum validation loss and the test accuracy at that iteration across various levels of sparsity. The sparser the network, the slower the learning and the lower the eventual test accuracy.
This is where the researchers have stated their lottery ticket hypothesis.
A randomly-initialized, dense neural network contains a sub-network, labeled as winning tickets. This is initialized such that, when trained in isolation, it can match the test accuracy of the original network after training for at most the same number of iterations.
Here is a superb illustration of the lottery ticket hypothesis concept:
We identify a ticket by training its network and pruning its smallest-magnitude weight. The remaining, unpruned connections constitute the architecture of the winning ticket.
Each unpruned connection’s value is then reset to its initialization from the original network before it was trained.
The process for doing this involves an iterative process of smart training and pruning. I have summarized it in five steps:
The pruning is one shot, which means it is done only once.
But in this paper, the researchers focus on iterative pruning, which repeatedly trains, prunes, and resets the network overground. Each round prunes p^(1/n) % of the weights that survive the previous round.
As a result, this iterative pruning finds winning tickets that match the accuracy of the original network at smaller sizes as compared to one-shot pruning.
A question that comes to everyone’s mind when reading these research papers – where in the world can we apply this? It’s all well and good experimenting and coming up trumps with a new approach. But the jackpot is in converting it into a real-world application.
This paper can be super useful for figuring out the winning tickets. Lottery Ticket Hypothesis could be applied on fully-connected networks trained on MNIST and on convolutional networks on CIFAR10, increasing both the complexity of the learning problem and the size of the network.
Existing work on neural network pruning demonstrates that the functions learned by a neural network can often be represented with fewer parameters. Pruning typically proceeds by training the original network, removing connections, and further fine-tuning.
In effect, the initial training initializes the weights of the pruned network so that it can learn in isolation during fine-tuning.
A winning ticket learns more slowly and achieves lower test accuracy when it is randomly reinitialized. This suggests that initialization is important to its success.
The initialization that gives rise to a winning ticket is arranged in a particular sparse architecture. Since we uncover winning tickets through heavy use of training data, we hypothesize that the structure of our winning tickets encodes an inductive bias customized to the learning task at hand.
The researchers are aware that this isn’t the finished product yet. The current approach has a few limitations which could be addressed in the future:
In this article, we thoroughly discussed the two best research papers published in ICLR. I learned so much while going through these papers and understanding the thought process of these research experts. I encourage you to go through these papers yourself once you finish this article.
There are a couple of more research focused conferences coming up soon. The International Conference on Machine Learning (ICML) and the Computer Vision and Pattern Recognition (CVPR) conferences are lined up in the coming months. Stay tuned!