Basics of CNN in Deep Learning

Debasish Kalita Last Updated : 08 Jan, 2025
8 min read

In this article, we will look into the fundamental principles and components that constitute the bedrock of CNNs. In this article, we unravel the intricate layers of neural networks shaping the future of artificial intelligence. Understanding the basics of CNN is not just a step; it’s a leap into deep learning, where the transformative power of Convolutional Neural Networks (CNNs) takes center stage. Join us as we demystify the workings of CNNs, exploring their architecture, operations, and profound impact on reshaping the landscape of deep learning. Whether you’re a novice eager to grasp the essentials or a seasoned practitioner looking to deepen your knowledge, this exploration of the Basics of CNN in Deep Learning promises to enlighten and inspire.

This article was published as a part of the Data Science Blogathon

What is Convolutional Neural Network?

Convolutional Neural Networks also known as CNNs or ConvNets, are a type of feed-forward artificial neural network whose connectivity structure is inspired by the organization of the animal visual cortex. Small clusters of cells in the visual cortex are sensitive to certain areas of the visual field. Individual neuronal cells in the brain respond or fire only when certain orientations of edges are present. Some neurons activate when shown vertical edges, while others fire when shown horizontal or diagonal edges. A convolutional neural network is a type of artificial neural network used in deep learning to evaluate visual information. These networks can handle a wide range of tasks involving images, sounds, texts, videos, and other media. Professor Yann LeCunn of Bell Labs created the first successful convolution networks in the late 1990s.

Convolution Neural Network | Basics of CNN

Convolutional Neural Networks (CNNs) have an input layer, an output layer, numerous hidden layers, and millions of parameters, allowing them to learn complicated objects and patterns. It uses Convolution and pooling processes sub-sample the given input before applying an activation function. All layers consist of hidden neurons that connect partially, with a completely connected layer at the end producing the output layer. The output shape is similar to the size of the input image.

Convolution is the process of combining two functions to produce the output of the other function. The input image is convoluted with the application of filters in CNNs, resulting in a Feature map. Filters are weights and biases that are randomly generated vectors in the network. Instead of having individual weights and biases for each neuron, CNN uses the same weights and biases for all neurons. Many filters can be created, each of which catches a different aspect from the input. Kernels are another name for filters.

Convolutional Layer

In convolutional neural networks (CNNs), the primary components are convolutional layers. These layers typically involve input vectors, such as an image, filters (or feature detectors), and output vectors, often referred to as feature maps. As the input, such as an image, traverses through a convolutional layer, it undergoes abstraction into a feature map, also known as an activation map. This process involves the convolution operation, which enables the detection of more complex features within the image.

Additionally, Rectified linear units (ReLU) commonly serve as activation functions within these layers to introduce non-linearity into the network. Furthermore, CNNs often employ pooling operations to reduce the spatial dimensions of the feature maps, leading to a more manageable output volume. Overall, convolutional layers play a crucial role in extracting meaningful features from the input data, making them fundamental in tasks such as image classification and natural language processing, among others, within the realm of machine learning models.

Feature Map = Input Image x Feature Detector

Convolutional layers convolve the input and pass the output to the next layer. This is analogous to a neuron’s response to a single stimulus in the visual cortex. Each convolutional neuron processes data only for its assigned receptive field.

A convolution is a grouping function in mathematics. Convolution occurs in CNNs when two matrices (rectangular arrays of numbers arranged in columns and rows) combine to generate a third matrix.

In the convolutional layers of a CNN, these convolutions filter input data to extract information.

Basics of CNN
Source: Cadalyst.com

Position the kernel’s center element above the source pixel. Then, replace the source pixel with a weighted sum of itself and its neighboring pixels.

Parameter sharing and local connectivity are two principles used in CNNs. In a feature map, all neurons share weights, which defines parameter sharing. Local connection means each neuron connects only to a part of the input image, unlike a fully connected neural network where all neurons connect to every input. This reduces the number of parameters in the system and speeds up the calculation.

Padding and Stride

Padding and stride have an impact on how the convolution procedure is carried out. They can be used to increase or decrease the dimensions (height and width) of input/output vectors.

The term describes how many pixels a CNN kernel adds to an image during processing. If you set the padding in a CNN to zero, every added pixel value will be zero. If you set the zero padding to one, a one-pixel border with a zero value will surround the image.

Padding and Stride
Source: vitalflux.com

Padding works by increasing the processing region of a convolutional neural network. The kernel is a neural network filter that moves through a picture, scanning each pixel and turning the data into a smaller or bigger format. You add padding to the image frame to help the kernel process the image by providing more room for it to cover the image. padding to a CNN-processed image provides for more accurate image analysis.

convolutional neural network
Source: Computer.org

Stride determines how the filter convolves over the input matrix, i.e. how many pixels shift. When you set the stride to 1, the filter moves across one pixel at a time, and when you set the stride to 2, the filter moves across two pixels at a time. The smaller the stride value, the smaller the output, and vice versa.

Pooling

Its purpose is to gradually shrink the representation’s spatial size to reduce the number of parameters and computations in the network. The pooling layer treats each feature map separately.

convolutional neural network| Basics of CNN
Source: Springer.com

The following are some methods for pooling:

  • Max-pooling: It chooses the most significant element from the feature map. The feature map’s significant features are stored in the resulting max-pooled layer. It is the most popular method since it produces the best outcomes.
  • Average pooling: It entails calculating the average for each region of the feature map.

Pooling gradually reduces the spatial dimension of the representation to reduce the number of parameters and computations in the network, as well as to prevent overfitting. If there is no pooling, the output has the same resolution as the input.

ReLU

The rectified linear activation function, or ReLU for short, is a piecewise linear function that, if the input is positive, outputs the input directly; else, it outputs zero. Because a model that utilizes it is quicker to train and generally produces higher performance, it has become the default activation function for many types of neural networks.

ReLu
Source: Superdatascience.com

At the end of CNN, there is a Fully connected layer of neurons. As in CNN (conventional Neural Networks), neurons in a fully connected layer have full connections to all activations in the previous layer and work similarly. After training, the fully connected layer generates the feature vector that classifies images into distinct categories. Every activation unit in the next layer connects to all inputs from this layer. Overfitting occurs because all of the parameters are occupied in the fully-connected layer. It can reduce overfitting using various strategies, including dropout.

Soft-max is an activation layer that is typically applied to the network’s last layer, which serves as a classifier. This layer is responsible for categorizing provided input into distinct types. A network’s non-normalized output is mapped to a probability distribution using the softmax function.

Basic Python Implementation

Importing Some  Relevant Libraries

import NumPy as np
%matplotlib inline
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import TensorFlow as tf
tf.compat.v1.set_random_seed(2019)

Loading the MNIST Dataset

(X_train,Y_train),(X_test,Y_test) = keras.datasets.mnist.load_data()

Scaling our Data

X_train = X_train / 255
X_test = X_test / 255


#flatenning
X_train_flattened = X_train.reshape(len(X_train), 28*28)
X_test_flattened = X_test.reshape(len(X_test), 28*28)

Designing Neural Network

model = keras.Sequential([

keras.layers.Dense(10, input_shape=(784,), activation='sigmoid')

])

model.compile(optimizer='adam',

loss='sparse_categorical_crossentropy',

metrics=['accuracy'])

model.fit(X_train_flattened, Y_train, epochs=5)

Output

Epoch 1/5
1875/1875 [==============================] - 8s 4ms/step - loss: 0.7187 - accuracy: 0.8141
Epoch 2/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.3122 - accuracy: 0.9128
Epoch 3/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.2908 - accuracy: 0.9187
Epoch 4/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.2783 - accuracy: 0.9229
Epoch 5/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.2643 - accuracy: 0.9262

How Convolutional Layers works?

Sliding Filters: Imagine a small window sliding over an image. This window has some numbers in it called weights. As it moves, it multiplies these weights with the numbers in the image underneath, and adds them up to make a new number. Convolution layers extract features. Finding Patterns: By adjusting these weights, the window learns to recognize patterns like edges or textures. For example, it might learn to detect a horizontal line or a diagonal edge. Sharing Knowledge: Instead of having different windows all over the image, we use the same window everywhere. This saves a lot of memory and helps the network learn faster. Convolution neural networks utilize this technique.

Building a Picture: As we slide these windows over the image, we build up a new picture. Each new picture highlights different patterns that we’ve learned. This process is crucial for image recognition and computer vision tasks.

Making Things Smaller: Sometimes, we don’t need all the details. So, we shrink the picture by combining nearby numbers. This makes things faster and helps us focus on the most important parts. This is particularly useful in medical image analysis.

Adding Some Curves: After all these operations, we apply a simple rule to make our picture more expressive. This helps us capture complicated relationships between the patterns we’ve found. This step is common in convolutional neural networks and other deep learning models. By repeating these steps with different patterns and pictures, we can teach a computer to recognize all sorts of things in images, like cats, cars, or even emotions on people’s faces! This involves earlier layers learning basic features and later layers combining them to recognize entire images.

Conclusion

The goal of this article was to provide an overview of convolutional neural networks and their main applications. These networks, in general, produce excellent classification and recognition results. They’re also used to decode audio, text, and video. If the task at hand is to find a pattern in a series, CNN (Convolutional Neural Networks) are an excellent choice.

Read more articles about CNNs here.

Frequently Asked Questions

Q1. What are the basics of CNN?

A. Convolutional Neural Networks (CNNs) are a class of deep learning models designed for image processing. They employ convolutional layers to automatically learn hierarchical features from input images.

Q2. What is the basic principle of CNN?

A. The basic principle of CNN lies in feature learning through convolutional layers. These layers apply filters to input data, extracting meaningful features and capturing spatial hierarchies for accurate pattern recognition.

Q3. What are the 4 components of CNN?

A. The four key components of CNN are convolutional layers, pooling layers, fully connected layers, and activation functions. These elements work together to enable feature extraction, dimension reduction, and classification in image data.

Q4. What are the basi operations of CNN?

A. CNN operations include convolution, where filters detect features, pooling to downsample and retain essential information, flattening to convert data for fully connected layers, and activation functions for introducing non-linearity in the model’s learning process.

The media shown in this article does not belong to Analytics Vidhya and the Author uses it at their discretion.

A graduate in Computer Science and Engineering from Tezpur Central University. Currently, I am pursuing my M.Tech in Computer Science and Engineering in the Department of CSE at NIT Durgapur. I expect to Postgraduate in the spring, 2022. A Grounded and Solution-oriented Computer Engineer with a wide variety of experiences. Adept at motivating self and others. Passionate about programming and educating the next generation of technology users and innovators.

Responses From Readers

Clear

Vinoj
Vinoj

Thanks, Debasish. Clear and Concise.

Abhiman MP
Abhiman MP

Thank You Debasish for making concepts clear with visualization.

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details