Fully Connected Layers

A fully connected layer (sometimes called a dense layer) is made up of $n$ neurons, and each of them receives all the output values coming from the previous layer(such as the hidden layer in a Multi-layer Perceptron(MLP)). It can be characterized by a weight matrix, a bias vector, and an activation function $\bar{y}=f(W\bar{x}+\bar{b})$

It’s important to remember that an MLP must contain non-linear activations(for example, sigmond, tanh, or ReLU). In fact, a network with $n$ linear hidden layers is equivalent to a standard perceptron In complex architectures, hey are normally used as termediate or output layers, in particular when it’s necessary to represent a probabilit distribution. For example, a deep architecture couild be employed for an image classification with $m$ output classes. In this case, the softmax activation function allows having output vector where each element is the probability of a class (and the sum of all outputs is always normalized to 1. 0). In this case, the argument is considered as a $\textit{logit}$ or the logarithm of odds: $logit_{i}(p)=log(\frac{p}{1-p})=W_i\bar{x}+b_i$
where $W_i$ is the $\textit{i-th}$ row of the W. The probability of a class $y_i$ is obtained by applying the $\textit{softmax}$ function to each $\textit{logit}$ : $p(y_i)=\frac{e^{\textit{logit}_i(p)}}{\sum_{j}e^{\textit{logit}_i(p)}}$
This type of output can easily be trained using a cross-entropy loss function.

Image Kernel and Convolution Operation

As we all know, a image could be denoted as a matrx of pixels. For exampe, Using a matrx witch same size as image, one could describe a gray image, three matrics with same size as image and a three-pannels image(also known as color image) and four metrics with same size and a four-pannels image(with alpha pannel)…
And what is image kernel? An image kernel is a metrix with small size(often 3X3, 5X5, 9X9), and sometimes we could also see an image kernel as a moving windows for some case.
What is convolution Operation? Convolution is a mathematical calculation between two matrics, by convolute two matrics, each element of one matrix is multipled by the corresponding element of another matrix , then the result is derived by adding all the sub-results of multiplications. The process of convolution operation is shown below:

Deep Learning Model--CNN

Applying Image Kernel on an Image

An image kernel is a small matrix used to apply effects like the ones you might find in Photoshop or Gimp, such as blurring, sharpening, outlining or embossing.They’re also used in machine learning for ‘feature extraction’, a technique for determining the most important portions of an image.Let’s walk through applying the following 3x3 blur kernel to the image of a face. We should define a kernel(which is gained from Gauss function) first:
Deep Learning Model--CNN
The for each pixel in the image below except for those at edges, we find the neigbor pixels of it and establish an 3X3 matrix denoted as $I$ with same size as the kernel. Then applying convolution operation of the $I$ and $K$ , and replace the original pixel by the result of convolution. Then we could found a new image shown as follows:
Deep Learning Model--CNN
As we can see , the output image is quite smooth, that is to say, the difference between two pairwise pixels is rather small than the input image. And the effect of applying image kernel varies by different image kernels, which in turn extract different feature of image.

Theory on CNN

CNN, also called convolution neuron network, is quite different form BP network, the latter could only receive values as input, but the former could recieve matrics as input. Moveover, CNN has it unique layer types but BP not, such as convolution layer, pooling layer and flatten layer as well.

Convolution Layer

Convolutional lavers are normally applied to bidimensional inputs (even though they can be used for vectors and 3D matrices), and they became particularly famous thanks to their extraordinary performance in image classification tasks. They are based on the convolution operation of a fixed-size image kernel with a bidimensional input(which can be the output of another convolutional layer)
For example, a convolution layer after the input layer often receive three matrics(obtained from 3-pannels image such as BGR) as input signals. Suppose there are $N$ nerouns in this layer. Then each neuron should contain three image kernel as the parameters of the model. Hence, there are $N\times{3}$ parameter matrics in this layer. With the lose of generality, the number of image kernels of a neuron in some layer equals to the number of the input matrices from the last layer.

Padding for Operate Convolution Successfully

When recieve matrics from the last layer, each neuron apply its image kernels to the input matrics. The procedure is quite different from discussion above. In this case , the kernels could be seen as a moving windows in the input matrcs, and we apply the convolution to the kernel and the sub-matrix below the kernel with a fix stride. But that will cause a problem when taking the stride into acccount.
As the figure shown below, suppose the stride is 2, the image kernel is 2x2 . then we could see that there still one colunm of pixels remain uncoverd . In this case, we often adding some zeros columns and rows to the original matrix to continued the “moving” convolution operation. In gerneral, we often call this process as zero padding.
Deep Learning Model--CNN

Convolution Operation of Nuerons

Suppose the present layer has two neurons, and receive three matrx from the last layer, hence there are three image kernels of a single neuron correspondingly.
Then we sum up all the results of convolution of kernel and sub-matrix from all the image kernel of a single neuron, and add the sum with a bias, which is represented by $zi$ for simplication sake. Then we use $zi$ as elements to form a net output matrix.（净输出矩阵）
If you are confused by my descrption, no problem , please refer to the following GIF:
Deep Learning Model--CNN
As we can see, the output from the last layer are three 5X5 matrics. And we define the size of image kernels are 3X3. Obviously we should deploy zeros padding if the strides equals to 2 like the GIF above. Then we move the kernel as well as operate convolution of kernels and sub-matrix coverd by the kernel. Then sum up all the convolution of three kernel and the bias every time and save it to the net output（净输出）shown as the green matrcs.
From the sample above, we could also observe that the number of ouput matrics equals to the number of the neurons in the layer. Therefore, in some article, we also refer the number of neurons in a convolution layer as depth.
Furthermore, like hidden layers in BP network, the convolution layers also contain activation function, which in this case , apply the function to each element of the net output.

Pooling Layer

The pooling layer could be seen as a layer with only one neuron. Hence, the size of its input equals to the size of its ouput. For example, if it recieve 5 matrics from the last layer, then it also output 5 matrics.
But what is the function of pooling layer? the pooling layer play a similar role of the function $textit{cv2.resize()}$ . By resizing the input matrics, pooling layer could compress the features and perform dimension reduction of the input, which in turn prevent over-fitting to some extense.

What does the cv2.resize do?

cv2.resize create a target image with a target size from the original image. The pixel position of target image with size Tx,Ty is denoted by (x’,y’)，and (x,y) for original image with size Sx,Sy. Then, (x,y) is derived using the following equation:
Deep Learning Model--CNN
After finding the corresponding position, we could caculate the pixel of (x’,y’) by the neigbor pixel of (x,y) using some mathematic operation such as NEAREST, LINEAR, OR CUBIC.

Resize Method of Pooling Layer

quite different from cv2.resize, the pooling layer caculate the pixel of (x’,y’) by maxium or average using the neighbor of (x,y)，for example:
Deep Learning Model--CNN

Flatten Layer

Quite Simlar to Pooling Layer, it could be seen as a single neuron layer. The role of flatten layer is to transform matrics into a single vector with large length. This is quite like the function .flatten() in Python. The role of flatten layer is as follows:
Deep Learning Model--CNN
Using flattern layer, we transform each input matrix as a vector, then concat all those vector as one single vector as ouput.

The WorkFrame of CNN

Under normal condition., before Flattern layer, CNN focus on convolution layer. The pooling layer is positioned among consecutive convolution layers. After flattern layer, the CNN acts like BP network where there are hiddern layer and output lyaer.
we follow ouput of each layer, then we could draw the following chart:
Deep Learning Model--CNN
When a matrix( or a gray image) is deliver to the first convolution layer, it will be transform into N matrics where N is the number of the neurons with one image kernel each neurons as parameters. Then the matrics gained from convolution layer(maybe walk through certain activition function) are delivered to the pooling there. Then, each matrix is resize to prevent over-fitting and reduce the complexity. After pooling, the outputs are delivered to another convolution and do the same procedure as disccused above until flattern layer. The flattern layer transform all the matrix to a single vector by expanding matrices by row. Hence, a vector is delivered to a sub-network(usually a BP network with single hiddern layer) as the input.

The Construction of CNN

So how to construct a CNN? Given a dataset, we could decide the number of neurons in input layer and output layer. Therefore, What we need to do is to decide the number of neurons in convolution layer and hidden layer in sub-network. In practice, the number of neurons should be decided according to the effectiveness of the model, which could be measured using cross-validation. However, time cost refuses us to do so. In general, we could also select the number of each layer according to some rule of thumb.
We use $n'$ to denoted the number of neurons in a layer, and $n$ the last layer, hence $n'$ could be derived from the following equations:
Deep Learning Model--CNN

Traning the Model and Dropout Regulation

Simliar to the BP network, we train the model by the backward propagation algorithm, that is to say, we caculate the error begin from output to input gradually and ajust the parameter of neurons at the same time. The parameters for traning is the elements of image kernel in convolution neurons and coefficient of the activation function.
In regression problem, we often choose MSE as loss function( also called cost function, risk function in some papers) and cross-entropy in classification problem. The algorithm using to find out the parameter namely to work out optimization problem of minimzing the loss function often choose stochastic search algorithm with mini-batch, such as SGD, RMSprop or Adam.
Like linear regression, the CNN also adopt some regulation strategies to prevent over-fitting. Despite there are pooling layers, some time the over-fitting still occurs for complexity of model sake. So in practice, one often employ dropout strategies to drop some neurons out during training process
For example, during each epoch of traning process, we random choose part of neuron and ignore them:
Deep Learning Model--CNN
Moreover, since dropout whole neurons is quit large, we could also dropout by links:

Again, we should know dropout is just a regulation method for solving over-fitting. When using the method, one MUST NOT dropout any neurons or links.

Further Topic

1、RAdam is a state-of-art method for solving the optimization problem published by Chinese students, and it is adopted by authority.
2、We often choose Relu as activation function, which has become very popular in the last few years.
3、Visualize the role of convolution layer.
Deep Learning Model--CNN
we the create a convolution layer with three neurons:

Then we know the stride is 1 using keras bulid in function
if we change the size of kernel from 3X3 to 10X10, what will happen? the same, we visualize the output image(matrx):

The we see the larger of kernel, the less information of the output.
Then, what will happen if we use just one neuron with 3X3 kernel？（three neurons last example）
Deep Learning Model--CNN
We see the output is more informational for us human but difficult for machine to “read” instead
4 What the role of pooling layer?
continue last sample, we add a pooling layer after a non-activation convolution layer:

We could see that the size of input image is reduced