The power of artificial intelligence is beyond our imagination. We all know robots have already reached a testing phase in some of the powerful countries of the world. Governments, large companies are spending billions in developing this ultra-intelligence creature. The recent existence of robots have gained attention of many research houses across the world.
Does it excite you as well ? Personally for me, learning about robots & developments in AI started with a deep curiosity and excitement in me! Let’s learn about computer vision today.
The earliest research in computer vision started way back in 1950s. Since then, we have come a long way but still find ourselves far from the ultimate objective. But with neural networks and deep learning, we have become empowered like never before.
Applications of deep learning in vision have taken this technology to a different level and made sophisticated things like self-driven cars possible in near future. In this article, I will also introduce you to Convolution Neural Networks which form the crux of deep learning applications in computer vision.
Note: This article is inspired by Stanford’s Class on Visual Recognition. Understanding this article requires prior knowledge of Neural Networks. If you are new to neural networks, you can start here. Another useful resource on basics of deep learning can be found here.
You can also learn Convolutional neural Networks in a structured and comprehensive manner by enrolling in this free course: Convolutional Neural Networks (CNN) from Scratch
As the name suggests, the aim of computer vision (CV) is to imitate the functionality of human eye and brain components responsible for your sense of sight.
Doing actions such as recognizing an animal, describing a view, differentiating among visible objects are really a cake-walk for humans. You’d be surprised to know that it took decades of research to discover and impart the ability of detecting an object to a computer with reasonable accuracy.
The field of computer vision has witnessed continual advancements in the past 5 years. One of the most stated advancement is Convolution Neural Networks (CNNs). Today, deep CNNs form the crux of most sophisticated fancy computer vision application, such as self-driving cars, auto-tagging of friends in our facebook pictures, facial security features, gesture recognition, automatic number plate recognition, etc.
Let’s get familiar with it a bit more:
Object detection is considered to be the most basic application of computer vision. Rest of the other developments in computer vision are achieved by making small enhancements on top of this. In real life, every time we(humans) open our eyes, we unconsciously detect objects.
Since it is super-intuitive for us, we fail to appreciate the key challenges involved when we try to design systems similar to our eye. Lets start by looking at some of the key roadblocks:
These are just some of the challenges which I brought up so that you can appreciate the complexity of the tasks which your eye and brain duo does with such utter ease. Breaking up all these challenges and solving individually is still possible today in computer vision. But we’re still decades away from a system which can get anywhere close to our human eye (which can do everything!).
This brilliance of our human body is the reason why researchers have been trying to break the enigma of computer vision by analyzing the visual mechanics of humans or other animals. Some of the earliest work in this direction was done by Hubel and Weisel with their famous cat experiment in 1959. Read more about it here.
This was the first study which emphasized the importance of edge detection for solving the computer vision problem. They were rewarded the nobel prize for their work.
Before diving into convolutional neural networks, lets take a quick overview of the traditional or rather elementary techniques used in computer vision before deep learning became popular.
Various techniques, other than deep learning are available enhancing computer vision. Though, they work well for simpler problems, but as the data become huge and the task becomes complex, they are no substitute for deep CNNs. Let’s briefly discuss two simple approaches.
I hope this gives some intuition into the challenges faced by approaches other than deep learning. Please note that more sophisticated techniques can be used than the ones discussed above but they would rarely beat a deep learning model.
Let’s discuss some properties of a neural networks. I will skip the basics of neural networks here as I have already covered that in my previous article – Fundamentals of Deep Learning – Starting with Neural Networks.
Once your fundamentals are sorted, let’s learn in detail some important concepts such as activation functions, data preprocessing, initializing weights and dropouts.
There are various activation functions which can be used and this is an active area of research. Let’s discuss some of the popular options:
To summarize, ReLU is mostly the activation function of choice. If the caveats are kept in mind, these can be used very efficiently.
For images, generally the following preprocessing steps are done:
Note that normalization is generally not done in images.
There can be various techniques for initializing weights. Lets consider a few of them:
One more thing must be remembered while using ReLU as activation function. It is that the weights initialization might be such that some of the neurons might not get activated because of negative input. This is something that should be checked. You might be surprised to know that 10-20% of the ReLUs might be dead at a particular time while training and even in the end.
These were just some of the concepts I discussed here. Some more concepts can be of importance like batch normalization, stochastic gradient descent, dropouts which I encourage you to read on your own.
Before going into the details, lets first try to get some intuition into why deep networks work better.
As we learned from the drawbacks of earlier approaches, they are unable to cater to the vast amount of variations in images. Deep CNNs work by consecutively modeling small pieces of information and combining them deeper in network.
One way to understand them is that the first layer will try to detect edges and form templates for edge detection. Then subsequent layers will try to combine them into simpler shapes and eventually into templates of different object positions, illumination, scales, etc. The final layers will match an input image with all the templates and the final prediction is like a weighted sum of all of them. So, deep CNNs are able to model complex variations and behaviour giving highly accurate predictions.
There is an interesting paper on visualization of deep features in CNNs which you can go through to get more intuition – Understanding Neural Networks Through Deep Visualization.
For the purpose of explaining CNNs and finally showing an example, I will be using the CIFAR-10 dataset for explanation here and you can download the data set from here. This dataset has 60,000 images with 10 labels and 6,000 images of each type. Each image is colored and 32×32 in size.
A CNN typically consists of 3 types of layers:
You might find some batch normalization layers in some old CNNs but they are not used these days. We’ll consider these one by one.
Since convolution layers form the crux of the network, I’ll consider them first. Each layer can be visualized in the form of a block or a cuboid. For instance in the case of CIFAR-10 data, the input layer would have the following form:
Here you can see, this is the original image which is 32×32 in height and width. The depth here is 3 which corresponds to the Red, Green and Blue colors, which form the basis of colored images. Now a convolution layer is formed by running a filter over it. A filter is another block or cuboid of smaller height and width but same depth which is swept over this base block. Let’s consider a filter of size 5x5x3.
We start this filter from the top left corner and sweep it till the bottom left corner. This filter is nothing but a set of eights, i.e. 5x5x3=75 + 1 bias = 76 weights. At each position, the weighted sum of the pixels is calculated as WTX + b and a new value is obtained. A single filter will result in a volume of size 28x28x1 as shown above.
Note that multiple filters are generally run at each step. Therefore, if 10 filters are used, the output would look like:
Here the filter weights are parameters which are learned during the back-propagation step. You might have noticed that we got a 28×28 block as output when the input was 32×32. Why so? Let’s look at a simpler case.
Suppose the initial image had size 6x6xd and the filter has size 3x3xd. Here I’ve kept the depth as d because it can be anything and it’s immaterial as it remains the same in both. Since depth is same, we can have a look at the front view of how filter would work:
Here we can see that the result would be 4x4x1 volume block. Notice there is a single output for entire depth of the each location of filter. But you need not do this visualization all the time. Let’s define a generic case where image has dimension NxNxd and filter has FxFxd. Also, lets define another term stride (S) here which is the number of cells (in above matrix) to move in each step. In the above case, we had a stride of 1 but it can be a higher value as well. So the size of the output will be:
output size = (N – F)/S + 1
You can validate the first case where N=32, F=5, S=1. The output had 28 pixels which is what we get from this formula as well. Please note that some S values might result in non-integer result and we generally don’t use such values.
Let’s consider an example to consolidate our understanding. Starting with the same image as before of size 32×32, we need to apply 2 filters consecutively, first 10 filters of size 7, stride 1 and next 6 filters of size 5, stride 2. Before looking at the solution below, just think about 2 things:
Here is the answer:
Notice here that the size of the images is getting shrunk consecutively. This will be undesirable in case of deep networks where the size would become very small too early. Also, it would restrict the use of large size filters as they would result in faster size reduction.
To prevent this, we generally use a stride of 1 along with zero-padding of size (F-1)/2. Zero-padding is nothing but adding additional zero-value pixels towards the border of the image.
Consider the example we saw above with 6×6 image and 3×3 filter. The required padding is (3-1)/2=1. We can visualize the padding as:
Here you can see that the image now becomes 8×8 because of padding of 1 on each side. So now the output will be of size 6×6 same as the original image.
Now let’s summarize a convolution layer as following:
Some additional points to be taken into consideration:
Having understood the convolution layer, lets move on to pooling layer.
When we use padding in convolution layer, the image size remains same. So, pooling layers are used to reduce the size of image. They work by sampling in each layer using filters. Consider the following 4×4 layer. So if we use a 2×2 filter with stride 2 and max-pooling, we get the following response:
Here you can see that 4 2×2 matrix are combined into 1 and their maximum value is taken. Generally, max-pooling is used but other options like average pooling can be considered.
At the end of convolution and pooling layers, networks generally use fully-connected layers in which each pixel is considered as a separate neuron just like a regular neural network. The last fully-connected layer will contain as many neurons as the number of classes to be predicted. For instance, in CIFAR-10 case, the last fully-connected layer will have 10 neurons.
I recommend reading the prior section multiple times and getting a hang of the concepts before moving forward.
In this section, I will discuss the AlexNet architecture in detail. To give you some background, AlexNet is the winning solution of IMAGENET Challenge 2012. This is one of the most reputed computer vision challenge and 2012 was the first time that a deep learning network was used for solving this problem.
Also, this resulted in a significantly better result as compared to previous solutions. I will share the network architecture here and review all the concepts learned above.
The detailed solution has been explained in this paper. I will explain the overall architecture of the network here. The AlexNet consists of a 11 layer CNN with the following architecture:
Here you can see 11 layers between input and output. Lets discuss each one of them individually. Note that the output of each layer will be the input of next layer. So you should keep that in mind.
I understand this is a complicated structure but once you understand the layers, it’ll give you a much better understanding of the architecture. Note that you fill find a different representation of the structure if you look at the AlexNet paper. This is because at that GPUs were not very powerful and they used 2 GPUs for training the network. So the work processing was divided between the two.
I highly encourage you to go through the other advanced solutions of ImageNet challenges after 2012 to get more ideas of how people design these networks. Some of interesting solutions are:
This video gives a brief overview and comparison of these solutions towards the end.
Having understood the theoretical concepts, lets move on to the fun part (practical) and make a basic CNN on the CIFAR-10 dataset which we’ve downloaded before.
I’ll be using GraphLab for the purpose of running algorithms. Instead of GraphLab, you are free to use alternatives tools such as Torch, Theano, Keras, Caffe, TensorFlow, etc. But GraphLab allows a quick and dirty implementation as it takes care of the weights initializations and network architecture on its own.
We’ll work on the CIFAR-10 dataset which you can download from here. The first step is to load the data. This data is packed in a specific format which can be loaded using the following code:
import pandas as pd import numpy as np import cPickle #Define a function to load each batch as dictionary: def unpickle(file): fo = open(file, 'rb') dict = cPickle.load(fo) fo.close() return dict #Make dictionaries by calling the above function: batch1 = unpickle('data/data_batch_1') batch2 = unpickle('data/data_batch_2') batch3 = unpickle('data/data_batch_3') batch4 = unpickle('data/data_batch_4') batch5 = unpickle('data/data_batch_5') batch_test = unpickle('data/test_batch') #Define a function to convert this dictionary into dataframe with image pixel array and labels: def get_dataframe(batch): df = pd.DataFrame(batch['data']) df['image'] = df.as_matrix().tolist() df.drop(range(3072),axis=1,inplace=True) df['label'] = batch['labels'] return df #Define train and test files: train = pd.concat([get_dataframe(batch1),get_dataframe(batch2),get_dataframe(batch3),get_dataframe(batch4),get_dataframe(batch5)],ignore_index=True) test = get_dataframe(batch_test)
We can verify this data by looking at the head and shape of data as follow:
print train.head()
print train.shape, test.shape
Since we’ll be using graphlab, the next step is to convert this into a graphlab SFrame and run neural network. Let’s convert the data first:
import graphlab as gl gltrain = gl.SFrame(train) gltest = gl.SFrame(test)
GraphLab has a functionality of automatically creating a neural network based on the data. Lets run that as a baseline model before going into an advanced model.
model = gl.neuralnet_classifier.create(gltrain, target='label', validation_set=None)
Here it used a simple fully connected network with 2 hidden layers and 10 neurons each. Let’s evaluate this model on test data.
model.evaluate(gltest)
As you can see that we have a pretty low accuracy of ~15%. This is because it is a very fundamental network. Lets try to make a CNN now. But if we go about training a deep CNN from scratch, we will face the following challenges:
To overcome these challenges, we can use pre-trained networks. These are nothing but networks like AlexNet which are pre-trained on many images and the weights for deep layers have been determined. The only challenge is to find a pre-trianed network which has been trained on images similar to the one we want to train. If the pre-trained network is not made on images of similar domain, then the features will not exactly make sense and classifier will not be of higher accuracy.
Before proceeding further, we need to convert these images into the size used in ImageNet which we’re using for classification. The GraphLab model is based on 256×256 size images. So we need to convert our images to that size. Lets do it using the following code:
#Convert pixels to graphlab image format gltrain['glimage'] = gl.SArray(gltrain['image']).pixel_array_to_image(32, 32, 3, allow_rounding = True) gltest['glimage'] = gl.SArray(gltest['image']).pixel_array_to_image(32, 32, 3, allow_rounding = True)
#Remove the original column gltrain.remove_column('image') gltest.remove_column('image')
gltrain.head()
Here we can see that a new column of type graphlab image has been created but the images are in 32×32 size. So we convert them to 256×256 using following code:
#Convert into 256x256 size gltrain['image'] = gl.image_analysis.resize(gltrain['glimage'], 256, 256, 3) gltest['image'] = gl.image_analysis.resize(gltest['glimage'], 256, 256, 3)
#Remove old column: gltrain.remove_column('glimage') gltest.remove_column('glimage')
gltrain.head()
Now we can see that the image has been converted into the desired size. Next, we will load the ImageNet pre-trained model in graphlab and use the features created in its last layer into a simple classifier and make predictions.
Lets start by loading the pre-trained model.
#Load the pre-trained model: pretrained_model = gl.load_model('http://s3.amazonaws.com/GraphLab-Datasets/deeplearning/imagenet_model_iter45')
Now we have to use this model and extract features which will be passed into a classifier. Note that the following operations may take a lot of computing time. I use a Macbook Pro 15″ and I had to leave it for whole night!
gltrain['features'] = pretrained_model.extract_features(gltrain) gltest['features'] = pretrained_model.extract_features(gltest)
Lets have a look at the data to make sure we have the features:
gltrain.head()
Though, we have the features with us, notice here that lot of them are zeros. You can understand this as a result of smaller data set. ImageNet was created on 1.2Mn images. So there would be many features in those images that don’t make sense for this data, thus resulting in zero outcome.
Now lets create a classifier using graphlab. The advantage with “classifier” function is that it will automatically create various classifiers and chose the best model.
simple_classifier = graphlab.classifier.create(gltrain, features = ['features'], target = 'label')
The various outputs are:
The final model selection is based on a validation set with 5% of the data. The results are:
So we can see that Boosted Trees Classifier has been chosen as the final model. Let’s look at the results on test data:
simple_classifier.evaluate(gltest)
So we can see that the test accuracy is now ~50%. It’s a decent jump from 15% to 50% but there is still huge potential to do better. The idea here was to get you started and I will skip the next steps. Here are some things which you can try:
You can find many open-source solutions for this dataset which give >95% accuracy. You should check those out. Please feel free to try them and post your solutions in comments below.
Now, its time to take the plunge and actually play with some other real datasets. So are you ready to take on the challenge? Accelerate your deep learning journey with the following Practice Problems:
Practice Problem: Identify the Apparels | Identify the type of apparel for given images | |
Practice Problem: Identify the Digits | Identify the digit in given images |
In this article, we covered the basics of computer vision using deep Convolution Neural Networks (CNNs). We started by appreciating the challenges involved in designing artificial systems which mimic the eye. Then, we looked at some of the traditional techniques, prior to deep learning, and got some intuition into their drawbacks.
We moved on to understanding the some aspects of tuning a neural networks such as activation functions, weights initialization and data-preprocessing. Next, we got some intuition into why deep CNNs should work better than traditional approaches and we understood the different elements present in a general deep CNN.
Subsequently, we consolidated our understanding by analyzing the architecture of AlexNet, the winning solution of ImageNet 2012 challenge. Finally, we took the CIFAR-10 data and implemented a CNN on it using a pre-trained AlexNet deep network.
I hope you liked this article. Did you find this article useful ? Please feel free to share your feedback through comments below. And to gain expertise in working in neural network try out the deep learning practice problem – Identify the Digits.
Thanks it's a great article. I just had one doubt i.e. In the example where you are applying two consecutive convolutional filters first 10 filters of size 7, stride 1 and next 6 filters of size 5, stride 2 on the Image of size 32x32x3 the diagram shows that after the first set of filters, the size of activation map should be 26x26x10 instead of 25x25x10. Is there a mistake in my understanding?
Hi Phil, Thanks for reaching out. Yes you're correct. I guess there is some mix-up with graphics guys. I'll fix it. :)
Can u help us building a Gender classifier??
I'm myself pretty new to this. But I can try to help you guys. Please feel free to connect with me at [email protected]
Hi Aarshay! Great article, thank you for posting! Where could I find a few pre trained CNNs for me to try out a few on my data? Thank you!
Hi Nico, Different packages provide pre-trained networks. For instance in this article I have used an AlexNet pre-trained network available in graphlab. If you're working on a vision problem, Caffe is a very good option. You'll find pre defined pre-trained networks there.