**1. AlexNet**

paper: 2012_ImageNet_ImageNet Classification with Deep Convolutional Neural Networks

characteristics: parallel structure

It’s the introduction of Alexnet. Modify the structure even a little will be worse accuracy.

Some tricks:

- Only doing subsampling(256 x 256) and feature scaling( minus average)
- Relu, the non-saturated activation, the shape is similar to softplus, very efficient, 6 times faster than tanh
- training on multiple GPU, every GPU share half kernels, 5 convolutional layers 3 FC layers, only connected in the 3rd layer and FC layer
- Overlapping pooling
- Reducing Overfitting
- Data Augmentation:
- extracting random 224 x 224 patches from 256 x 256 images
- Using PCA to intense the proponent feature

- Dropout

- Data Augmentation:
- Set SGD, weight decay, momentum and lr gradually reduce

Others:

- Parameter amount: 60million
- structure merge place: merge and then convolution

**2. GoogLeNet**

paper: 2014_ImageNet_Going deeper with convolutions

Some tricks:

- Network in network
- Inception
- Inception is the Â optimal local sparse structure of a convolutional vision network, and to repeat it spatially
- 3 options, incept5,6,7, have different results
- 1 x 1 convolutions are used to compute reductions before the expensive 3 x 3 and 5 x 5 convolutions

Others:

- most of the progress is not just the result of more powerful hardware, large datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures.
- current leading approach for object detection is the Regions with Convolutional Neural Networks (R-CNN)
- use first utilized low-level cues such as color and superpixel consistency for potential object proposals
- use CNN classifiers to Identify object categories at those locations

- how to use zero padding

**3. VGG 16 and VGG 19**

paper: 2015_ImageNet_Very Deep Convolutional Networks For Large-scale Image Recognition

Some tricks:

- Address the important aspact of ConvNet architecture design: its dapth. Steadily increase the depth of the network by adding more convolutional layers
- the architecture is feasible because of very small convolutional filters in all layers: 3 x 3
- filter quantity gradually increase from beginning to the end
- use Zeroppading, Convolutional, Maxpooling, Flatten, Dense, Dropout

Others:

- 1 x 1 architecture can be seen as a linear transformation of the input channels
- weight quantity:
- VGG_16: 138M
- VGG_19: 144M

- the first flatten layer contains about 100M parameters
- based on the available C++ Caffe toolbox
- Dropout usually after flatten

**4. SqueezeNet**

paper: 2016_arXiv_SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

to identify a model that has very few parameters while preserving accuracy

Characteristics:

- high accuracy: similar to AlexNet
- 50x fewer parameters
- <0.5MB model size
- no fully-connected layers
- final layers:
- ConV10: 13 x 13 x 1000
- AveragePooling: 1 x 1 x 1000, it means 1000 possibilities

- fire structure, overlapping pooling

Some tricks:

- GoogLeNet using 3 x 3, SqueezeNet replace 3 x 3 filters with 1 x 1 filters, since 1 x 1 filters has 9X fewer parameters than a 3 x 3
- Decrease the number of input channels to 3 x 3 filters
- the total quantity of parameters in the layer is (number of input channels) x (number of filters) x (filter size) x (filter size)
- to decrease parameters, decrease input channel quantity is a solution

- Downsample late in the network so that convolutional layers have large activation maps
- If early layers in the network have large strides, then most layers will have small architecture maps.
- large activation maps can lead to higher classification accuracy.

- The fire module, define the file module, comprised if:
- a squeeze convolution layer(which has only 1 x 1 filters),
- the aforementioned layer fed into an expand layer that has a mix of 1 x 1 and 3 x 3 convolutional filters

- the structure:
- begin with a standalone convolutional layer
- followed by 8 fire modules
- gradually increase the number of filters per fire module from the beginning to the end.
- maxpooling (overlapping pooling)with stride 2 only after conv1, fire4, fire8, and conv10
- final average pooling
- final layers:
- ConV10: 13 x 13 x 1000
- AveragePooling: 1 x 1 x 1000, it means 1000 possibilities

Others:

- model compression: a sensible approach is to take an existing CNN model and compress it in a lossy fashion
- SVD can also compress model
- based on Caffe framework, caffe can not support a convolutional layer that contains multiple filter resolutions(e.g. 1 x 1, 3 x 3), to make it work, using the zero padding to 1 x 1 filter, but Keras can support
- network pruning: which begins with a pretrained model, then replaces parameters that are below a certain threshold with zeros to form a sparse matrix, and finally performs a few iterations of training on the sparse CNN
- if we use pruning idea, dense to sparse to dense (DSD) training yield higher accuracy, improve: 4.3% top1 accuracy
- On top of the sparse SqueezeNet(pruned3x), let the killed weights recover, initialize them from zero
- let the survived weight keeping their value
- retrain the whole network

**5. ResNet (under construction)**

paper: 2015_ImageNet_Deep Residual Learning for Image Recognition

Characteristics:

- 150 layers, 8 times of VGG19
- using deep residual network