Self-supervised learning has so much untapped potential in deep learning. Where supervised learning requires tons of labelled data to come up with an accurate and precise solution, self-supervised learning only needs a sliver of that labelled data (if any at all). Which is what makes it such a challenging and difficult line of work.
But self-supervised learning has been garnering attention recently, especially in the field of computer vision (which notoriously requires more labelled data than most fields to give a proper output). And now the Google AI team has developed a model that can track objects in videos without requiring labelled at all.
The team has designed a convolutional neural network that adds color to grayscale videos. While doing this, the network learns by itself to visually track objects in the video. The team admits in a blog post that the model was never trained with the singular aim of tracking, but it managed to learn without supervision and can follow multiple objects and remain robust without requiring ANY labelled training data!
The researchers used videos from the public Kinetics dataset to train the model. Keep in mind that all these videos are in color so they were first converted to grayscale, except the very first frame in each video. The convolutional network was then trained to predict the original colors in all the remaining frames. The below collection of images illustrates this technique well:
You might we wondering why did they decolor the videos in the first place? This is because there’s a good chance that there might be multiple objects in the video with the same color and by converting it to greyscale and then adding color again, the team was able to teach the machine to track specific objects.
An important part of designing and using models in deep learning is their interpretability, which isn’t easy given the complexity associated with them. According to their blog post, they used “a standard trick to visualize the embeddings learned by the model by projecting them down to three dimensions using Principal Component Analysis (PCA) and plotting it as an RGB movie”.
Another finding from the model was that it’s even able to track the pose of humans. See the below image that shows the poses of different humans being tracked (this was tested on the JHMDB dataset).
You can read about this technique in more detail in Google’s research paper here.
If you read the paper (and you really should!) you’ll see that the results of this model don’t outperform high-end supervised models. But since this is just the starting point for self-supervised video tracking, I think we can expect that gap to shrink significantly soon.
I especially liked that the model is doing multiple things – colorization, pose estimation, and of course object tracking. It turns out that the failures of the model are correlated with a failure to colorize videos, which pinpoints where the team needs to work on. This is definitely something we should keep our eye on in the foreseeable future as the potential and possibilities of using this technique are vast.