It’s a Record-Breaking Crowd! A Must-Read Tutorial to Build your First Crowd Counting Model using Deep Learning

Pulkit Sharma Last Updated : 04 Jan, 2021
9 min read

Introduction

Artificial Intelligence and Machine Learning is going to be our biggest helper in coming decade!

Today morning, I was reading an article which reported that an AI system won against 20 lawyers and the lawyers were actually happy that AI can take care of repetitive part of their roles and help them work on complex topics. These lawyers were happy that AI will enable them to have more fulfilling roles.

Today, I will be sharing a similar example – How to count number of people in crowd using Deep Learning and Computer Vision? But, before we do that – let us develop a sense of how easy the life is for a Crowd Counting Scientist.

 

Act like a Crowd Counting Scientist

Let’s start!

Can you help me count / estimate number of people in this picture attending this event?

Ok – how about this one?

Crowd image

Source: ShanghaiTech Dataset

You get the hang of it. By end of this tutorial, we will create an algorithm for Crowd Counting with an amazing accuracy (compared to humans like you and me). Will you use such an assistant?

 

P.S. This article assumes that you have a basic knowledge of how convolutional neural networks (CNNs) work. You can refer to the below post to learn about this topic before you proceed further:

 

Table of Contents

  1. What is Crowd Counting?
  2. Why is Crowd Counting required?
  3. Understanding the Different Computer Vision Techniques for Crowd Counting
  4. The Architecture and Training Methods of CSRNet
  5. Building your own Crowd Counting model in Python

This article is highly inspired by the paper – CSRNet : Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes.

 

What is Crowd Counting?

Crowd Counting is a technique to count or estimate the number of people in an image. Take a moment to analyze the below image:

Crowd image

Source: ShanghaiTech Dataset

Can you give me an approximate number of how many people are in the frame? Yes, including the ones present way in the background. The most direct method is to manually count each person but does that make practical sense? It’s nearly impossible when the crowd is this big!

Crowd scientists (yes, that’s a real job title!) count the number of people in certain parts of an image and then extrapolate to come up with an estimate. More commonly, we have had to rely on crude metrics to estimate this number for decades.

Surely there must be a better, more exact approach?

Yes, there is!

While we don’t yet have algorithms that can give us the EXACT number, most computer vision techniques can produce impressively precise estimates. Let’s first understand why crowd counting is important before diving into the algorithm behind it.

 

Why is Crowd Counting useful?

Let’s understand the usefulness of crowd counting using an example. Picture this – your company just finished hosting a huge data science conference. Plenty of different sessions took place during the event.

You are asked to analyze and estimate the number of people who attended each session. This will help your team understand what kind of sessions attracted the biggest crowds (and which ones failed in that regard). This will shape next year’s conference, so it’s an important task!

Image result for conference

There were hundreds of people at the event – counting them manually will take days! That’s where your data scientist skills kick in. You managed to get photos of the crowd from each session and build a computer vision model to do the rest!

There are plenty of other scenarios where crowd counting algorithms are changing the way industries work:

  • Counting the number of people attending a sporting event
  • Estimating how many people attended an inauguration or a march (political rallies, perhaps)
  • Monitoring of high-traffic areas
  • Helping with staffing allocation and resource allotment

Can you come up with some other use cases? Let me know in the comments section below! We can connect and try to figure out how we can use crowd counting techniques in your scenario.

 

Understanding the Different Computer Vision Techniques for Crowd Counting

Broadly speaking, there are currently four methods we can use for counting the number of people in a crowd:

1. Detection-based methods

Here, we use a moving window-like detector to identify people in an image and count how many there are. The methods used for detection require well trained classifiers that can extract low-level features. Although these methods work well for detecting faces, they do not perform well on crowded images as most of the target objects are not clearly visible.

2. Regression-based methods

We were unable to extract low level features using the above approach. Regression-based methods come up trumps here. We first crop patches from the image and then, for each patch, extract the low level features.

3. Density estimation-based methods

We first create a density map for the objects. Then, the algorithm learn a linear mapping between the extracted features and their object density maps. We can also use random forest regression to learn non-linear mapping.

4. CNN-based methods

Ah, good old reliable convolutional neural networks (CNNs). Instead of looking at the patches of an image, we build an end-to-end regression method using CNNs. This takes the entire image as input and directly generates the crowd count. CNNs work really well with regression or classification tasks, and they have also proved their worth in generating density maps.

CSRNet, a technique we will implement in this article, deploys a deeper CNN for capturing high-level features and generating high-quality density maps without expanding the network complexity. Let’s understand what CSRNet is before jumping to the coding section.

 

Understanding the Architecture and Training Method of CSRNet

CSRNet uses VGG-16 as the front end because of its strong transfer learning ability. The output size from VGG is ⅛th of the original input size. CSRNet also uses dilated convolutional layers in the back end.

But what in the world are dilated convolutions? It’s a fair question to ask. Consider the below image:

Dilated Convolutions

The basic concept of using dilated convolutions is to enlarge the kernel without increasing the parameters. So, if the dilation rate is 1, we take the kernel and convolve it on the entire image. Whereas, if we increase the dilation rate to 2, the kernel extends as shown in the above image (follow the labels below each image). It can be an alternative to pooling layers.

 

Underlying Mathematics (Recommended, but optional)

I’m going to take a moment to explain how the mathematics work. Note that this isn’t mandatory to implement the algorithm in Python, but I highly recommend learning the underlying idea. This will come in handy when you need to tweak or modify your model.

Suppose we have an input x(m,n), a filter w(i,j), and the dilation rate r. The output y(m,n) will be:

Dilation output

We can generalize this equation using a (k*k) kernel with a dilation rate r. The kernel enlarges to:

([k + (k-1)*(r-1)] * [k + (k-1)*(r-1)])

So the ground truth has been generated for each image. Each person’s head in a given image is blurred using a Gaussian kernel. All the images are cropped into 9 patches, and the size of each patch is ¼th of the original size of the image. With me so far?

The first 4 patches are divided into 4 quarters and the other 5 patches are randomly cropped. Finally, the mirror of each patch is taken to double the training set.

That, in a nutshell, are the architecture details behind CSRNet. Next, we’ll look at its training details, including the evaluation metric used.

Stochastic Gradient Descent is used to train the CSRNet as an end-to-end structure. During training, the fixed learning rate is set to 1e-6. The loss function is taken to be the Euclidean distance in order to measure the difference between the ground truth and estimated density map. This is represented as:

Loss function

where N is the size of the training batch. The evaluation metric used in CSRNet is MAE and MSE, i.e., Mean Absolute Error and Mean Square Error. These are given by:

MAE and MSE

Here, Ci is the estimated count:

Estimated count

L and W are the width of the predicted density map.

Our model will first predict the density map for a given image. The pixel value will be 0 if no person is present. A certain pre-defined value will be assigned if that pixel corresponds to a person. So, calculating the total pixel values corresponding to a person will give us the count of people in that image. Awesome, right?

And now, ladies and gentlemen, it’s time to finally build our own crowd counting model!

 

Building your own Crowd Counting model

Ready with your notebook powered up?

We will implement CSRNet on the ShanghaiTech dataset. This contains 1198 annotated images of a combined total of 330,165 people. You can download the dataset from here.

Use the below code block to clone the CSRNet-pytorch repository. This holds the entire code for creating the dataset, training the model and validating the results:

git clone https://github.com/leeyeehoo/CSRNet-pytorch.git

Please install CUDA and PyTorch before you proceed further. These are the backbone behind the code we’ll be using below.

Now, move the dataset into the repository you cloned above and unzip it. We’ll then need to create the ground truth values. The make_dataset.ipynb file is our savior. We just need to make minor changes in that notebook:

 

 

#setting the root to the Shanghai dataset you have downloaded
# change the root path as per your location of dataset
root = '/home/pulkit/CSRNet-pytorch/'

Now, let’s generate the ground truth values for images in part_A and part_B:

 

Generating the density map for each image is a time taking step. So go brew a cup of coffee while the code runs.

So far, we have generated the ground truth values for images in part_A. We will do the same for the part_B images. But before that, let’s see a sample image and plot its ground truth heatmap:

plt.imshow(Image.open(img_paths[0]))

ShanghaiTech part_A train image

Things are getting interesting!

gt_file = h5py.File(img_paths[0].replace('.jpg','.h5').replace('images','ground-truth'),'r')
groundtruth = np.asarray(gt_file['density'])
plt.imshow(groundtruth,cmap=CM.jet)

Let’s count how many people are present in this image:

np.sum(groundtruth)

270.32568

Similarly, we will generate values for part_B:

Now, we have the images as well as their corresponding ground truth values. Time to train our model!

We will use the .json files available in the cloned directory. We just have to change the location of the images in the json files. To do this, open the .json file and replace the current location with the location where your images are located.

Note that all this code is written in Python 2. Make the following changes if you’re using any other Python version:

  1. In model.py, change xrange in line 18 to range
  2. Change line 19 in model.py with: list(self.frontend.state_dict().items())[i][1].data[:] = list(mod.state_dict().items())[i][1].data[:]
  3. In image.py, replace ground_truth with ground-truth

Made the changes? Now, open a new terminal window and type the following commands:

cd CSRNet-pytorch
python train.py part_A_train.json part_A_val.json 0 0

Again, sit back because this will take some time. You can reduce the number of epochs in the train.py file to accelerate the process. A cool alternate option is to download the pre-trained weights from here if you don’t feel like waiting.

Finally, let’s check our model’s performance on unseen data. We will use the val.ipynb file to validate the results. Remember to change the path to the pretrained weights and images.

#defining the image path
img_paths = []
for path in path_sets:
    for img_path in glob.glob(os.path.join(path, '*.jpg')):
       img_paths.append(img_path)

 

model = CSRNet()

 

#defining the model
model = model.cuda()

 

#loading the trained weights
checkpoint = torch.load('part_A/0model_best.pth.tar')
model.load_state_dict(checkpoint['state_dict'])

Check the MAE (Mean Absolute Error) on test images to evaluate our model:

We got an MAE value of 75.69 which is pretty good. Now let’s check the predictions on a single image:

CSRNet outputWow, the original count was 382 and our model estimated there were 384 people in the image. That is a very impressive performance!

Congratulations on building your own crowd counting model!

 

End Notes

I encourage you to try out this approach on different images and share your results in the comments section below. Crowd counting has so many diverse applications and is already seeing adoption by organizations and government bodies.

It is a useful skill to add to your portfolio. Quite a number of industries will be looking for data scientists who can work with crowd counting algorithms. Learn it, experiment with it, and give yourself the gift of deep learning!

Did you find this article useful? Feel free to leave your suggestions and feedback for me below, and I’ll be happy to connect with you.

You should also check out the below resources to learn and explore the wonderful world of computer vision:

My research interests lies in the field of Machine Learning and Deep Learning. Possess an enthusiasm for learning new skills and technologies.

Responses From Readers

Clear

Saurabh Singh
Saurabh Singh

Nicely explained the complex concept of Crowd counting . And its very helpful.

Preeti
Preeti

Very nicely explained. Good one Pulkit

sandesh
sandesh

Great learning through your article .

Congratulations, You Did It!
Well Done on Completing Your Learning Journey. Stay curious and keep exploring!

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details