Different CNN structures

1. AlexNet

paper: 2012_ImageNet_ImageNet Classification with Deep Convolutional Neural Networks

characteristics: parallel structure

It’s the introduction of Alexnet. Modify the structure even a little will be worse accuracy.

Some tricks:

  1. Only doing subsampling(256 x 256) and feature scaling( minus average)
  2. Relu, the non-saturated activation, the shape is similar to softplus, very efficient, 6 times faster than tanh
  3. training on multiple GPU, every GPU share half kernels, 5 convolutional layers 3 FC layers, only connected in the 3rd layer and FC layer
  4. Overlapping pooling
  5. Reducing Overfitting
    1. Data Augmentation:
      1. extracting random 224 x 224 patches from 256 x 256 images
      2. Using PCA to intense the proponent feature
    2. Dropout
  6. Set SGD, weight decay, momentum and lr gradually reduce


  1. Parameter amount: 60million
  2. structure merge place: merge and then convolution

2. GoogLeNet

paper: 2014_ImageNet_Going deeper with convolutions

Some tricks:

  1. Network in network
  2. Inception
  3. Inception is the  optimal local sparse structure of a convolutional vision network, and to repeat it spatially
  4. 3 options, incept5,6,7, have different results
  5. 1 x 1 convolutions are used to compute reductions before the expensive 3 x 3 and 5 x 5 convolutions


  1. most of the progress is not just the result of more powerful hardware, large datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures.
  2. current leading approach for object detection is the Regions with Convolutional Neural Networks (R-CNN)
    1. use first utilized low-level cues such as color and superpixel consistency for potential object proposals
    2. use CNN classifiers to Identify object categories at those locations
  3. how to use zero padding

3. VGG 16 and VGG 19

paper: 2015_ImageNet_Very Deep Convolutional Networks For Large-scale Image Recognition

Some tricks:

  1. Address the important aspact of ConvNet architecture design: its dapth. Steadily increase the depth of the network by adding more convolutional layers
  2. the architecture is feasible because of very small convolutional filters in all layers: 3 x 3
  3. filter quantity gradually increase from beginning to the end
  4. use Zeroppading, Convolutional, Maxpooling, Flatten, Dense, Dropout


  1. 1 x 1 architecture can be seen as a linear transformation of the input channels
  2. weight quantity:
    1. VGG_16: 138M
    2. VGG_19: 144M
  3. the first flatten layer contains about 100M parameters
  4. based on the available C++ Caffe toolbox
  5. Dropout usually after flatten

4. SqueezeNet

paper: 2016_arXiv_SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

to identify a model that has very few parameters while preserving accuracy


  1. high accuracy: similar to AlexNet
  2. 50x fewer parameters
  3. <0.5MB model size
  4. no fully-connected layers
  5. final layers:
    1. ConV10: 13 x 13 x 1000
    2. AveragePooling: 1 x 1 x 1000, it means 1000 possibilities
  6. fire structure, overlapping pooling

Some tricks:

  1. GoogLeNet using 3 x 3, SqueezeNet replace 3 x 3 filters with 1 x 1 filters, since 1 x 1 filters has 9X fewer parameters than a 3 x 3
  2. Decrease the number of input channels to 3 x 3 filters
    1. the total quantity of parameters in the layer is (number of input channels) x (number of filters) x (filter size) x (filter size)
    2. to decrease parameters, decrease input channel quantity is a solution
  3. Downsample late in the network so that convolutional layers have large activation maps
    1. If early layers in the network have large strides, then most layers will have small architecture maps.
    2. large activation maps can lead to higher classification accuracy.
  4. The fire module, define the file module, comprised if:
    1. a squeeze convolution layer(which has only 1 x 1 filters),
    2. the aforementioned layer fed into an expand layer that has a mix of 1 x 1 and 3 x 3 convolutional filters
  5. the structure:
    1. begin with a standalone convolutional layer
    2. followed by 8 fire modules
    3. gradually increase the number of filters per fire module from the beginning to the end.
    4. maxpooling (overlapping pooling)with stride 2 only after conv1, fire4, fire8, and conv10
    5. final average pooling
    6. final layers:
      1. ConV10: 13 x 13 x 1000
      2. AveragePooling: 1 x 1 x 1000, it means 1000 possibilities


  1. model compression: a sensible approach is to take an existing CNN model and compress it in a lossy fashion
  2. SVD can also compress model
  3. based on Caffe framework, caffe can not support a convolutional layer that contains multiple filter resolutions(e.g. 1 x 1, 3 x 3), to make it work, using the zero padding to 1 x 1 filter, but Keras can support
  4. network pruning: which begins with a pretrained model, then replaces parameters that are below a certain threshold with zeros to form a sparse matrix, and finally performs a few iterations of training on the sparse CNN
  5. if we use pruning idea, dense to sparse to dense (DSD) training yield higher accuracy, improve: 4.3% top1 accuracy
    1. On top of the sparse SqueezeNet(pruned3x), let the killed weights recover, initialize them from zero
    2. let the survived weight keeping their value
    3. retrain the whole network

5. ResNet (under construction)

paper: 2015_ImageNet_Deep Residual Learning for Image Recognition


  1. 150 layers, 8 times of VGG19
  2. using deep residual network






Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s