CONVOLUTIONAL NEURAL NETWORK (CNN)

5 min readMay 25, 2021

Convolutional Neural Network (CNN) is a feed-forward neural network that is generally used to analyze visual images by processing data with grid-like topology. A CNN is also known as “ConvNet”.

How Computer reads an image?

The image is broken down into 3 color channels is Red, Green, and Blue. Each of the channels is mapped to the image pixel.

Then, the computer recognizes the value associated with each pixel and determines the size of the image.

However, for black-white images, there are only two-channel and the concept is the same.

Why not Fully Connected Networks?

We cannot make use of fully connected networks when it comes to Convolutional Neural Networks, here’s why!

Here, we have considered the input of images with the size 28x28x3 pixels. If we input this to our Convolutional Neural Network, we will have about 2352 weights in the first hidden layer itself.

But this case isn’t practical. Now, take a look at this:

Any generic input image will at least have 200x200x3 pixels in size. The size of the first hidden layer becomes a whopping 120,000. If this is just the first hidden layer, imagine the number of neurons needed to process an entire complex image set.

This leads to overfitting and isn’t practical. Hence, we cannot make use of fully connected networks.

Why Convolutional Neural Networks?

The Neuron in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully connected manner.

In CNN, every image is represented in the form of an array of pixel values.

How Do Convolutional Neural Networks Work?

Input Layer accepts the pixels of the image as input in the form of arrays.

Hidden Layer carry out feature extraction by performing certain calculation and manipulation.

Finally, there is a fully connected layer that identifies the object in the image.

There are multiple hidden layers like the Convolution layer, ReLU layer, Pooling layer, Fully connected layer., that perform feature extraction from the image.

Convolution Layer

This is the first step in the process of extracting valuable features from an image. This layer used a matrix filter and performs convolution operations to detect patterns in the image.

Consider the following 5x5 image whose pixel values are only 0 and 1.

Sliding the filter matrix over the image and computing the dot product to get the convolved feature matrix.

Let’s see how it works:

Sliding the filter matrix again and again to cover all the pixels of the input image. Finally, we get the filled convolved matrix.

Once the feature maps are extracted, the next step is to move them to the ReLU layer.

ReLU Layer

ReLU stands for the rectified linear unit. ReLU activation function is applied to the convolution layer to get a rectified feature map of the image.

In this layer, we have four steps:

Perform element-wise operation
Sets all negative values to 0
Introduces non-linearity to the network
The output is a rectified feature map

Below is the graph of the ReLU function:

The rectified feature map now goes through a pooling layer.

Pooling Layer

Pooling is a down-sampling operation that reduces the dimensionality of the feature map.

The pooling layer also uses multiple filters to detect edges, corners, etc.

The pooling layer has three types:

Max pooling — returns the maximum value of a pixel from a portion of the image covered by the kernel. Max Pooling also performs as a Noise Suppressant. It discards the noisy activations altogether and also performs de-noising along with dimensionality reduction.
Average pooling — returns the average of all the values from the portion of the image covered by the Kernel. Average Pooling simply performs dimensionality reduction as a noise suppressing mechanism.

Max Pooling performs a lot better than Average Pooling

Consider the following 4x4 rectified feature map

The next step in the process is called flattening.

Flattening is the process of converting all the resultant 2-dimensional arrays from the pooled feature map into a single long continuous linear vector.

Fully Connected Layer

The flattened matrix from the pooling layer is fed as input to the fully connected layer to classify the image.

The last layers in the network are fully connected, meaning that neurons of preceding layers are connected to every neuron in subsequent layers.

Here’s how exactly CNN recognizes a bird:

The pixels from the image are fed to the convolutional layer that performs the convolution operation
It results in a convolved map
The convolved map is applied to a ReLU function to generate a rectified feature map
The image is processed with multiple convolutions and ReLU layers for locating the features
Different pooling layers with various filters are used to identify specific parts of the image
The pooled feature map is flattened and fed to a fully connected layer to get the final output