In the realm of computer vision, Convolutional Neural Networks (CNNs) have redefined the landscape of image analysis and understanding. These powerful networks have enabled breakthroughs in tasks such as image classification, object detection, and semantic segmentation. They have laid the foundation for a wide range of applications in fields like healthcare, autonomous vehicles, and more.
However, as the demand for more context-aware and robust models continues to grow, traditional convolutional layers within CNNs have faced limitations in capturing extensive contextual information. This has led to the need for innovative techniques that can enhance the network’s ability to understand broader contexts without significantly increasing computational complexity.
Enter Atrous Convolution, a groundbreaking approach that has disrupted the conventional norms of convolutional layers within CNNs. Atrous Convolution, also known as dilated convolution, introduces a new dimension to the world of deep learning by enabling networks to capture broader context without significantly increasing computational cost or parameters.
This article was published as a part of the Data Science Blogathon.
Convolutional Neural Networks (CNNs) are a class of deep neural networks primarily designed for analyzing visual data like images & videos. They’re inspired by the human visual system and are exceptionally effective in tasks involving pattern recognition within visual data. Here’s the breakdown:
Atrous convolution, also known as dilated convolution, is a type of convolutional operation that introduces a parameter called the dilation rate. Unlike regular convolution, which applies filters to adjacent pixels, atrous convolution spaces out the filter parameters by introducing gaps between them, controlled by the dilation rate. This process enlarges the receptive field of the filters without increasing the number of parameters. In simpler terms, it allows the network to capture a broader context from the input data without adding more complexity.
The dilation rate determines how many pixels are skipped between each step of the convolution. A rate of 1 represents regular convolution, while higher rates skip more pixels. This enlarged receptive field enables capturing larger contextual information without increasing the computational cost, allowing the network to capture both local details and global context efficiently.
In essence, atrous convolution facilitates the integration of wider context information into convolutional neural networks, enabling better modeling of large-scale patterns within the data. It’s commonly used in applications where context at varying scales is crucial, such as semantic segmentation in computer vision or handling sequences in natural language processing tasks.
Dilated convolutions, also known as atrous convolutions, have been pivotal in multi-scale feature learning within neural networks. Here are some key points about their role in enabling multi-scale feature learning:
Input Image (Rectangle)
|
|
Regular Convolution (Box)
- Kernel Size: Fixed kernel
- Sliding Strategy: Across input feature maps
- Stride: Usually 1
- Output Feature Map: Reduced size
Atrous (Dilated) Convolution (Box)
- Kernel Size: Fixed kernel with gaps (controlled by dilation)
- Sliding Strategy: Spaced elements, increased receptive field
- Stride: Controlled by dilation rate
- Output Feature Map: Preserves input size, expanded receptive field
Aspect | Regular Convolution | Atrous (Dilated) Convolution |
---|---|---|
Filter Application | Applies filters to contiguous regions of input data | Introduces gaps between filter elements (holes) |
Kernel Size | Fixed kernel size | Fixed kernel size, but with gaps (controlled by dilation) |
Sliding Strategy | Slides across input feature maps | Spaced elements allow for an enlarged receptive field |
Stride | Usually, a stride of 1 | Increased effective stride, controlled by dilation rate |
Output Feature Map Size | Reduction in size due to convolution | Preserves input size while increasing receptive field |
Receptive Field | Limited effective receptive field | Expanded effective receptive field |
Context Information Capture | Limited context capture | Enhanced capability to capture broader context |
DeepLab is a series of convolutional neural network architectures created for semantic image segmentation. It is recognized for using atrous convolutions (also known as dilated convolutions) and atrous spatial pyramid pooling (ASPP) to capture multi-scale contextual information in images, allowing for precise pixel-level segmentation.
Here’s an overview of DeepLab:
Code:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Conv2DTranspose
def create_DeepLab_model(input_shape, num_classes):
model = Sequential([
Conv2D(64, (3, 3), activation='relu', padding='same', input_shape=input_shape),
Conv2D(64, (3, 3), activation='relu', padding='same'),
MaxPooling2D(pool_size=(2, 2)),
Conv2D(128, (3, 3), activation='relu', padding='same'),
Conv2D(128, (3, 3), activation='relu', padding='same'),
MaxPooling2D(pool_size=(2, 2)),
# Add more convolutional layers as needed
Conv2DTranspose(64, (3, 3), strides=(2, 2), padding='same', activation='relu'),
Conv2D(num_classes, (1, 1), activation='softmax', padding='valid')
])
return model
# Define input shape and number of classes
input_shape = (256, 256, 3) # Example input shape
num_classes = 21 # Example number of classes
# Create the DeepLab model
deeplab_model = create_DeepLab_model(input_shape, num_classes)
# Compile the model (you might want to adjust the optimizer and loss function based on your task)
deeplab_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Print model summary
deeplab_model.summary()
Code:
import tensorflow as tf
# Define the atrous convolution layer function
def atrous_conv_layer(inputs, filters, kernel_size, rate):
return tf.keras.layers.Conv2D(filters=filters, kernel_size=kernel_size,
dilation_rate=rate, padding='same', activation='relu')(inputs)
# Example FCN architecture with atrous convolutions
def FCN_with_AtrousConv(input_shape, num_classes):
inputs = tf.keras.layers.Input(shape=input_shape)
# Encoder (VGG-style)
conv1 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(inputs)
conv2 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(conv1)
# Atrous convolution layers
atrous_conv1 = atrous_conv_layer(conv2, 128, (3, 3), rate=2)
atrous_conv2 = atrous_conv_layer(atrous_conv1, 128, (3, 3), rate=4)
# Add more atrous convolutions as needed...
# Decoder (transposed convolution)
upsample = tf.keras.layers.Conv2DTranspose(64, (3, 3), strides=(2, 2), padding='same')
(atrous_conv2)
output = tf.keras.layers.Conv2D(num_classes, (1, 1), activation='softmax')(upsample)
model = tf.keras.models.Model(inputs=inputs, outputs=output)
return model
# Define input shape and number of classes
input_shape = (256, 256, 3) # Example input shape
num_classes = 10 # Example number of classes
# Create an instance of the FCN with AtrousConv model
model = FCN_with_AtrousConv(input_shape, num_classes)
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Display model summary
model.summary()
LinkNet is an advanced image segmentation architecture that combines the efficiency of its design with the power of atrous convolutions, also known as dilated convolutions. It leverages skip connections to enhance information flow and accurately segment images.
Code:
import torch
import torch.nn as nn
import torch.nn.functional as F
class ConvBlock(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1):
super(ConvBlock, self).__init__()
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size,
stride=stride, padding=padding)
self.bn = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
x = self.conv(x)
x = self.bn(x)
x = self.relu(x)
return x
class DecoderBlock(nn.Module):
def __init__(self, in_channels, out_channels):
super(DecoderBlock, self).__init__()
self.conv1 = ConvBlock(in_channels, in_channels // 4, kernel_size=1, stride=1, padding=0)
self.deconv = nn.ConvTranspose2d(in_channels // 4, out_channels, kernel_size=4,
stride=2, padding=1)
self.conv2 = ConvBlock(out_channels, out_channels)
def forward(self, x, skip):
x = F.interpolate(x, scale_factor=2, mode='nearest')
x = self.conv1(x)
x = self.deconv(x)
x = self.conv2(x)
if skip is not None:
x += skip
return x
class LinkNet(nn.Module):
def __init__(self, num_classes=21):
super(LinkNet, self).__init__()
# Encoder
self.encoder = nn.Sequential(
ConvBlock(3, 64),
nn.MaxPool2d(2),
ConvBlock(64, 128),
nn.MaxPool2d(2),
ConvBlock(128, 256),
nn.MaxPool2d(2),
ConvBlock(256, 512),
nn.MaxPool2d(2)
)
# Decoder
self.decoder = nn.Sequential(
DecoderBlock(512, 256),
DecoderBlock(256, 128),
DecoderBlock(128, 64),
DecoderBlock(64, 32)
)
# Final prediction
self.final_conv = nn.Conv2d(32, num_classes, kernel_size=1)
def forward(self, x):
skips = []
for module in self.encoder:
x = module(x)
skips.append(x.clone())
skips = skips[::-1] # Reverse for decoder
for i, module in enumerate(self.decoder):
x = module(x, skips[i])
x = self.final_conv(x)
return x
# Example usage:
input_tensor = torch.randn(1, 3, 224, 224) # Example input tensor shape
model = LinkNet(num_classes=10) # Example number of classes
output = model(input_tensor)
print(output.shape) # Example output shape
This method adapts Fully Convolutional Networks (FCNs), which are highly effective for semantic segmentation, for instance-aware semantic segmentation. Unlike the original FCN, where each output pixel is a classifier of an object category, in InstanceFCN, each output pixel is a classifier of the relative positions of instances. For example, in the score map, each pixel is a classifier of whether it belongs to the “right side” of an instance or not.
An FCN is applied on the input image to generate k² score maps, each corresponding to a particular relative position. These are called instance-sensitive score maps. To produce object instances from these score maps, a sliding window of size m×m is used. The m×m window is divided into k², m ⁄ k × m ⁄ k dimensional windows corresponding to each of the k² relative positions. Each m ⁄ k × m ⁄ k sub-window of the output directly copies values from the same sub-window in the corresponding score map. The k² sub-windows are put together according to their relative positions to assemble an m×m segmentation output. For example, the #1 sub-window of the output in the figure above is taken directly from the top-left m ⁄ k × m ⁄ k sub-window of the m×m window in the #1 instance-sensitive score map. This is called the instance assembling module.
The architecture consists of applying VGG-16 fully convolutionally on the input image. On the output feature map, there are two fully convolutional branches. One of them is for estimating segment instances (as described above) and the other is for scoring the instances.
Atrous convolutions, which introduce gaps in the filter, are used in parts of this architecture to expand the network’s field of view and capture more context information.
For the first branch, 1×1 512-d conv. layer followed by a 3×3 conv. layer is used to generate the set of k² instance-sensitive score maps. The assembling module (as described earlier) is used to predict the m×m(= 21) segmentation mask. The second branch consists of a 3×3 512-d conv. layer followed by a 1×1 conv. layer. This 1×1 conv. layer is a per-pixel logistic regression for classifying instance/not an instance of the m×m sliding window centered at this pixel. Hence, the output of the branch is an objectness score map in which one score corresponds to one sliding window that generates one instance. Hence, this method is blind to the different object categories.
Code:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, concatenate
# Define your atrous convolution layer
def atrous_conv_layer(input_layer, filters, kernel_size, dilation_rate):
return Conv2D(filters=filters, kernel_size=kernel_size,
dilation_rate=dilation_rate, padding='same', activation='relu')(input_layer)
# Define your InstanceFCN model
def InstanceFCN(input_shape):
inputs = Input(shape=input_shape)
# Your VGG-16 like fully convolutional layers here
conv1 = Conv2D(64, (3, 3), activation='relu', padding='same')(inputs)
conv2 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv1)
# Atrous convolution layer
atrous_conv = atrous_conv_layer(conv2, filters=128, kernel_size=(3, 3),
dilation_rate=(2, 2))
# More convolutional layers and branches for scoring and instance estimation
# Output layers for scoring and instance estimation
score_output = Conv2D(num_classes, (1, 1), activation='softmax')(... )
# Your score output
instance_output = Conv2D(num_instances, (1, 1), activation='sigmoid')(... )
# Your instance output
return Model(inputs=inputs, outputs=[score_output, instance_output])
# Usage:
model = InstanceFCN(input_shape=(256, 256, 3)) # Example input shape
model.summary() # View the model summary
Fully Convolutional Instance-aware Semantic Segmentation (FCIS) is built up of the IntanceFCN method. InstanceFCN is only able to predict a fixed m×m dimensional mask and cannot classify the object into different categories. FCIS fixes all of that by predicting different dimensional masks while also predicting the different object categories.
Given a RoI, the pixel-wise score maps are produced by the assembling operation as described above under InstanceFCN. For each pixel in ROI, there are two tasks (hence, two score maps are produced):
Based on these, three cases arise:
For detection, the max operation is used to differentiate cases 1 and 2 (detection+) from case 3 (detection-). The detection score of the whole ROI is obtained via average pooling over all pixels’ likelihoods followed by the softmax operator across all the categories. For segmentation, softmax is used to differentiate case 1 (segmentation+) from the rest (segmentation-). The foreground mask of the ROI is the union of the per-pixel segmentation scores for each category.
ResNet is used to extract the features from the input image fully convolutionally. An RPN is added on top of the conv4 layer to generate the ROIs. From the conv5 feature map, 2k² × C+1 score maps are produced (C object categories, one background category, two sets of k² score maps per category) using a 1×1 conv. layer. The RoIs (after non-maximum suppression) are classified as the categories with the highest classification scores. To obtain the foreground mask, all RoIs with intersection-over-union scores higher than 0.5 with the RoI under consideration are taken. The mask of the category is averaged on a per-pixel basis, weighted by their classification scores. The averaged mask is then binarized.
Atrous Convolutions have transformed semantic segmentation by addressing the challenge of capturing contextual information without sacrificing computational efficiency. These dilated convolutions are designed to expand receptive fields while maintaining spatial resolution. They have become essential components of modern architectures such as DeepLab, LinkNet, and others.
The capability of Atrous Convolutions to capture multi-scale features and improve contextual understanding has led to their widespread adoption in cutting-edge segmentation models. As research progresses, the integration of Atrous Convolutions with other techniques holds the promise of further advancements in achieving precise, efficient, and contextually rich semantic segmentation across diverse domains.
A. Atrous Convolutions allow exploring different scales within an image without compromising on its details, enabling more comprehensive feature extraction.
A. Unlike regular convolutions, Atrous Convolutions introduce gaps in the filter elements, effectively increasing the receptive field without downsampling.
A. Atrous Convolutions are prevalent in semantic segmentation, image classification, and object detection tasks due to their ability to preserve image details.
A. Yes, Atrous Convolutions help maintain computational efficiency by retaining the resolution of the feature maps, allowing for larger receptive fields without increasing the number of parameters significantly.
A. No, Atrous Convolutions can be integrated into various architectures like DeepLab, LinkNet, and others, showcasing their versatility across different frameworks.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.